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Abstract 

Background: Logos are commonly used in molecular biology to provide a compact graphical representation of the 
conservation pattern of a set of sequences. They render the information contained in sequence alignments or profile 
hidden Markov models by drawing a stack of letters for each position, where the height of the stack corresponds to 
the conservation at that position, and the height of each letter within a stack depends on the frequency of that letter 
at that position. 

Results: We present a new tool and web server, called Skylign, which provides a unified framework for creating logos 
for both sequence alignments and profile hidden Markov models. In addition to static image files, Skylign creates a 
novel interactive logo plot for inclusion in web pages. These interactive logos enable scrolling, zooming, and inspection 
of underlying values. Skylign can avoid sampling bias in sequence alignments by down-weighting redundant 
sequences and by combining observed counts with informed priors. It also simplifies the representation of gap 
parameters, and can optionally scale letter heights based on alternate calculations of the conservation of a position. 

Conclusion: Skylign is available as a website, a scriptable web service with a RESTful interface, and as a software 
package for download. Skylign's interactive logos are easily incorporated into a web page with just a few lines of HTML 
markup. Skylign may be found at http://skylign.org. 
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Background 

Alignments and profile hidden Markov models 

Alignments of multiple biological sequences play an im- 
portant role in a wide range of bioinformatics applica- 
tions, and are used to represent sequence families that 
range in size from DNA binding site motifs to full 
length proteins, ribosomal RNAs, and autonomous 
transposable elements. In an alignment, sequences are 
organized such that each column contains amino acids 
(or nucleotides) related by descent or shared functional 
constraint. The distributions of letters will typically vary 
from column to column. These patterns can reveal im- 
portant characteristics of the sequence family, for ex- 
ample highlighting sites vital to conformation or ligand 
binding. 
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A sequence alignment can be used to produce a profile 
hidden Markov model (profile HMM). Profile HMMs 
provide a formal probabilistic framework for sequence 
comparison [1-3], leveraging the information contained 
in a sequence alignment to improve detection of dis- 
tantly related sequences [4,5]. They are, for example, 
used in the annotation of both protein domains [6-9] 
and genomic sequence derived from ancient transpos- 
able element expansions [10]. 

Consider a family of related sequences, and an align- 
ment of a subset of those sequences. For each column, we 
can think of the observed letters as having been sampled 
from the distribution, p of letters at that position among 
all members of the sequence family. One approach to esti- 
mating p for a column is to compute a maximum likeli- 
hood estimate directly from observed counts at that 
column. An alternative is to try to improve the estimate 
using sequence weighting (relative [11] and absolute [12]) 
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Table 1 Relationship between DNA letter distribution and 
information content 
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Values assume a 4-letter DNA alphabet with a uniform background 
distribution. The maximum information content under these conditions is 2.0, 
for an invariant distribution. The minimum value is 0.0, achieved when the 
letter distribution matches the background. Note that small perturbations 
away from invariance result in large reductions in information content. 



and mixture Dirichlet priors [2,13-15]. The later approach 
is used in computing position-specific letter distributions 
for profile HMMs [16,17]. 

In an alignment, a subset of the columns will be con- 
sensus columns, in which most sequences are repre- 
sented by a letter, rather than a gap character. In a 
typical profile HMM, a model position is created for 
each consensus column, and non-consensus columns 
are treated as insertions relative to model positions. As 
with letters, the per-position gap distributions may be 
estimated from observed or weighted counts, or com- 
bined with a Dirichlet prior. 



Logos 

A logo provides a compact graphical representation of 
an alignment, representing each column with a stack of 
letters. The total height of each stack corresponds to a 
measure of the invariance of the column - typically, it is 
the information content of that position. The height of 
each letter within a stack depends on the frequency of 
that letter at that position. Logos were originally devised 
to represent the extent of letter conservation in each 
column of an alignment [18,19], and were later general- 
ized to show letter and gap probabilities of a profile 
HMM [20]. 

Consider an alphabet A consisting of L letters, a x 
through a L (L is 4 for DNA, and 20 for amino acids). 
For a given column in an alignment, we capture the esti- 
mated column distribution as a length -L vector p, such 
that pi is the probability of observing letter a t at that col- 
umn. We define the length-L vector ~q to be the back- 
ground distribution over letters in A, such that q t is the 
background probability of observing letter a b typically 
based on letter frequency in a large set of representative 
sequences. 

Given p and the information content [18] of the col- 
umn, also called relative entropy or Kullback-Leibler 
distance [15,21], is defined as: 

L 

D(p | ?) = $>, log W*<)- (!) 

1 

When the base of the log is 2, the information content 
is expressed in bits. This value indicates the extent to 
which a columns distribution p differs from the back- 
ground q , and serves as a measure of the conservation 
of the column. Information content is non-negative, 
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Stack height is the 
relative entropy of 
the column: 




Letter height divides 
stack height according 
to letter frequency. 



Figure 1 Example profile logo. This logo shows positions 64 to 81 of the Peptidase_C14 profile HMM from Pfam (PF00656, Pfam 27.0), 
produced using Skylign. The profile HMM was constructed using hmmbuild (default parameters) from HMMER 3.1 on the Pfam seed alignment. 
One of the active sites of this Caspase domain is found at position 75. This site is invariant in active peptidases, but not in this profile HMM. 
This is the result of two forces: (1) the Pfam alignment includes non-peptidase homologs, which do not contain a Histidine at this position, and 
(2) HMMER intentionally drives down the information content per position (using an approach called entropy weighting [12]) to increase 
sensitivity to remote homologs. 
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Figure 2 Compressed overview. The overview compresses letters into thin vertical bars to enable visualization over a wide section of a logo. 
Here, we show positions 51 to 160 of the HMM shown in Figure I.The color of each bar matches the color of the letter represented by that bar. 
The active sites of this Caspase domain are found at positions 75 (Histidine, shown as the tall brown bar surrounded by yellow bars to the left) 
and 127 (Cysteine, shown as the taller turquoise bar to the right). The positions around 150 are located in a surface-exposed loop, so it is not 
surprising to see that they have low sequence conservation and occupancy, and that some have a high insert probability and expected insert 
length. Below the logo is a table of residue and gap probabilities for position 75, shown as a result of clicking on the corresponding stack - it 
shows that Histidine (H) is most common, followed by Lysine (K) and Arginine (R). 



largest when a column is invariant, and especially large 
when the invariant letter is rare in ~q . For example, the 
maximum information content for one column in a 
DNA alignment under uniform background distribution 
is 2 bits. The maximum for an amino acid alignment 
under the background corresponding to the BLOSUM62 
scoring matrix is roughly 6.5 bits - this for an invariant 
column of Tryptophan, which has the lowest back- 
ground probability. Table 1 shows examples of informa- 
tion content values for a few DNA letter distributions, 
to give some insight into the complex relationship be- 
tween information content and letter frequencies. 

For a conventional logo, a stacks height is spread 
among the letters in alphabet A based on p , such that 
the height of each letter a t within a stack is (p r D(p \~q)). 
Letters are sorted such that those with larger p t appear 
near the top in the stack. An example is shown in 
Figure 1. 



Implementation 

We present a software tool and associated web service, 
called Skylign, which offers several advantages over 
existing logo tools. It can generate both a static image 
file and a new interactive web plot that supports scrol- 
ling, zooming, and inspection of values underlying each 
letter stack. Skylign also produces a simplified represen- 
tation of per-position gap probabilities, and optionally 



reduces visual clutter by including only overrepresented 
letters in a stack. Skyligns interactive logos are robust 
and fast for alignments with length in the thousands, 
such as those representing many transposable element 
families. 

An important implementation detail is that Skylign 
produces logos for both profile HMMs and multiple 
sequence alignments in a unified framework. Profile 
logos are plotted using the per-position distributions of 
the profile HMM. For alignment logos, the column dis- 
tributions can be estimated either from observed counts, 
weighted counts, or based on posterior probabilities 
after combining with a Dirichlet mixture prior. Estima- 
tion based on weights and priors is performed by expli- 
citly producing a profile HMM using the hmmbuild tool 
within HMMER3.1 [17]. 

In the following sections, we describe implementation 
details, compare alternative visualization approaches, 
and illustrate the utility of these logos. Skylign can be 
accessed as a web service at http://skylign.org, and the 
Skylign software package may be downloaded for inde- 
pendent installation. 

Results and discussion 

Several logo web servers have been released since the 
introduction of logos [20,22-24], each with their own 
enhancements to logo presentation. In the course of 
developing websites for sequence homology search and 
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Figure 3 Example of profile logos from LogoMat and Skylign. Particular focus is placed on gap parameter visualization, using a section of 
the Pfam Pkinase protein domain (Pfam accession PF00069). Positions 137 and 148 both have high probability of being followed by an insertion 
(29% and 53%, respectively), with modest expected insert length (7.7 and 4.5, respectively). Position 148 also has a low occupancy (70%). (A) The 
logo produced by LogoMat, which represents gap parameters by stretching in the horizontal plane. Positions with low occupancy are given less 
horizontal space, but this is difficult to see. Insert rate and expected length are represented with variable-width red and pink columns. (B) Skylign 
simplifies visual interpretation of gap parameters by presenting a three row table beneath the letter stacks. The top row shows occupancy, with 
stronger blue background indicating lower occupancy. The middle row presents insert probability, and the bottom row shows expected insert 
length. For both insert rows, a stronger red background indicates higher values. Note that the default expected insert length (1.9) depends on 
the priors used when constructing the HMM; observed shorter or longer inserts can shift the expected length away from this value. When a cell 
in the middle row (insert probability) is not white, a thin red vertical bar of matching color is drawn immediately after the position for that cell, 
indicating that the insertion will produce letters between the neighboring positions. 



annotation, we identified a need for interactive web- 
enabled logos that could efficiently render very long 
logos, and offer alternate letter height options, im- 
proved visualization of per-position gap parameters, 
and the ability to inspect underlying values. We devel- 
oped Skylign to meet these needs. 

Web-enabled interactive logos 

Historically, all logo software has produced static images 
(e.g. png or vector graphics files). These are the appropri- 
ate formats for inclusion in manuscripts and slides, and 
may be produced with Skylign, but are suboptimal for dis- 
tribution on the web. For website integration, Skylign im- 
plements interactive logos that support navigation to a 
requested position in the logo, scroll smoothly, and can be 
zoomed out for a compressed overview of several hundred 
positions of the logo. Because profile HMMs create posi- 
tions only for consensus columns, and because a logo 
stack is defined only for non-empty alignment columns, 



not all columns in an alignment will be represented by a 
position in a profile logo; Skylign optionally shows the 
mapping between each logo position and the correspond- 
ing column in the underlying alignment. Skylign logos also 
support clicking on individual letter stacks to view the 
underlying values for all letters, as seen in Figure 2. 

The data used to produce an interactive logo is stored 
as a JSON object, which is rendered using HTML5 
Canvas and a custom JavaScript module. Adding a 
Skylign interactive logo to a web page is simple, requir- 
ing the addition of a few lines of markup to the page 
and reference to the Skylign javascript and ess files. 

Skylign may be used in a variety of ways to create an 
image or interactive logo. The simplest option is to use 
the website submission form. Skylign also offers a web 
service via a RESTful interface [25], enabling scripted 
logo creation. Finally, the Skylign package may be down- 
loaded for local installation. Instructions for all of these 
options are available at http://skylign.org. 
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(See figure on previous page.) 

Figure 4 Comparison of alternate methods of producing an alignment logo from an input sequence alignment. Logos were built using 

the Rfam family seed alignment of 45 7SK sequences (RF00100, Rfam 1 1.0 [26]). Skylign performs necessary counting, weighting, and mixing by 

explicitly building a profile HMM using the HMMER tool hmmbuild. (A) Logo for the alignment based on observed frequencies dhmmbuild symfroc 

0 -wnone -enone -pnone 1 ). Positions 105 and 106 highlight the fact that Skylign creates a logo position for every non-empty column in the alignment. 

Occupancy is -2% because only one of 45 sequences contains a letter at each of those positions. Stacks are tall because stack height depends on 

observed counts, and there is no variability at these positions. (B) Logo for the alignment after applying sequence weighting [1 1] to account for 

sequence redundancy ^hmmbuild symfrac 0 -pnone'). The letter G at the first visible position of the weighted logo indicates much less conservation 

than does the G for the logo based on observed counts. This is because most of the support for high conservation of G at that position comes from a 

large set of highly similar sequences, and the importance of such redundant sequences is diminished under sequence weighting. Sequence weighting 

can also alter the represented occupancy rates, for example showing a weighted 7% occupancy for positions 105 and 106. (C) Logo for the alignment 

after applying sequence weighting, absolute weighting [12], and Dirichlet priors. This amounts to building a profile HMM under default HMMER 

conditions, except that a match state is created for every non-empty column in the alignment ('hmmbuild symfroc 0). In the case of low weighted 

counts, as in positions 105 and 106, HMMER's priors typically increase letter variance, leading to lower information content. (D) Logo for the HMM 

built using default 'hmmbuild', in which logo positions are created only for consensus, resulting in removal of positions 105 and 106. 
v J 



Position-specific gap parameters 

In addition to representing the letter distribution at each 
position, Skylign renders position-specific gap parame- 
ters. It does this by presenting up to three values for 
each position k: 

1. Occupancy: the probability of observing a letter at 
position k If we call this value, occ(k) the probability 
of observing a gap character (part of a deletion 
relative to the model) is 1 - occ(k). 

2. Insert probability: the probability of observing one 
or more letters inserted between the letter 
corresponding to position k and the letter 
corresponding to position (k+ 1). 

3. Insert length: the expected length of an insertion 
following position k, if one is observed. For 
mathematical convenience, profile HMMs model 
insertions as having a geometric length distribution 
with position-specific parameter e and mean length 

1/(1-8). 

The later two are only relevant for profile logos, since 
Skylign creates a logo position for each non-empty col- 
umn in the alignment when producing an alignment 
logo. 

The tool LogoMat [20] generalized alignment logos to 
present these gap parameters for profile HMMs. In 
LogoMat, occupancy is represented by varying the width 
of the letter stacks (the stack is thinner for positions 
with lower occupancy). The insertion probability and ex- 
pected length are represented by placing variable-width 
two-toned columns between each letter stack, where the 
width of the darker part of the column corresponds to 
the insert rate and the width of the lighter part conflates 
expected length with insert rate. The result is that gap 
information is encoded by stretching the horizontal 
plane. As seen in Figure 3A, column width differences 
are difficult to discern. In Skylign, stack spacing is uni- 
form, and these parameters are instead represented by 



up to three rows of numerical values placed below the 
letter stacks of the logo, with a heat map laid over the 
top of each value to provide a visual aid. See Figure 3B 
for an example. This approach - pulling gap information 
into a distinct section below the letter stacks - bears some 
similarity to the approach used in the SUPERFAMILY 
database [6], and simplifies visualization of gap parameters. 

Unified framework for profile logos and alignment logos 

Skylign can produce a profile logo based on a profile 
HMM, or an alignment logo based on a sequence align- 
ment, both sharing the same interface. Generating a pro- 
file logo is a straightforward matter: a profile HMM 
stores estimated letter and gap parameters based on the 
underlying sequence alignment. Skylign simply extracts 
these values for use in computing stack heights, letter 
heights, and gap-related values. Alignment logos are 
more flexible, since Skylign offers four methods for com- 
puting estimated distributions from observed frequen- 
cies, demonstrated in Figure 4. For all methods, Skylign 
uses the hmmbuild tool from HMMER 3.1 to compute 
letter and gap values, with alternate option flags used for 
each method. 

Logo height options 

In the case of protein sequences, when observed counts 
are combined with a strong Dirichlet mixture prior, the 
posterior letter distribution often contains small but 
non- negligible probabilities for all 20 letters. This results 
in an illegible smear of letters at the bottom of the letter 
stack of the typical logo (Figure 5A). To address this, 
Skylign offers an alternate method of computing letter 
heights for a position, in which the only letters shown in 
each stack are those with above-background probability. 
Given the column letter distribution p and the back- 
ground distribution ~q , the score of letter a t in that col- 
umn is its log odds ratio, s t := log 2 (p//<7;). Letters with 
above-background probability will have a positive score. 
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Figure 5 Comparison of alternate methods of computing letter height within a stack. These examples were built using the profile HMM 
for Peptidase_C14 Pfam protein domain family (PF00656), built using the hmmbuild tool from HMMER 3.1. (A) Information content (oil): letter stack 
height is the information content of the column, and all letters subdivide that stack height according to their probability. For models built using 
strong priors, as with HMMER 3.1, it is common to see an unreadable clutter of below-background letters at the bottom of the stack. See an 
example under the prominent D and Q at position 125. (B) Information content (above-background): a less noisy variant, in which stack height is also 
based on information content, and that height is divided only among letters with above-background probability. Notice the reduced letter clutter 
in position 125. (C) Score: a variant in which a letter's height depends on the score of that letter at that position. Only positive-scoring letters 
(those with above-background probability) are included in the stack. In this case, the height of a stack does not have any inherent meaning - it 
is simply the sum of all letter heights. As an example of this, note that the stack for position 1 26 is slightly taller than the stack for position 1 25, 
despite the fact that position 1 25 is much more conserved as seen in Figure 5A and B. This is because the more conserved 1 25 has only two 
positive scoring letters (D=3.4, Q=2.6), while position 126 has five (C=2.0, A=1.8, S=1.7, T=0.8, M=0.04).The stacking order in Figure 5C may differ 
from the order in Figures A and B. This is because relative letter height in A and B depends only on the frequency in the distribution "p, whereas 
letter height in 5C depends on the score, which accounts for the background distribution, s,-:= log jip/Qi)- 
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Total stack height is computed in the typical fashion (in- 
formation content), and the height of the stack is subdi- 
vided according to the relative probabilities of the 
positive-scoring letters, as shown in Figure 5B. In the 
interactive web logo, clicking a column reveals a list of 
the probabilities of observing each letter at that position 
(both above- and below-background letters). 

Skylign also offers an option to produce a different 
sort of logo in which the height of each letter is its score, 
s t . Only positive-scoring letters are included in the stack, 
as demonstrated in Figure 5C We find this logo useful, 
for example, when inspecting per-position scores of an 
alignment of a sequence to a profile HMM. It is import- 
ant to emphasize that the height of a score stack does 
not have any inherent meaning - it is simply the sum of 
all letter heights. In the interactive web logo, clicking a 
column reveals a list of scores for all letters of the alpha- 
bet, including those with negative scores. 

Conclusion 

Logos have long been used to visually represent the 
position-specific patterns of conservation in sequence 
alignments and profile HMMs. We developed Skylign 
with the aim of enabling interactive manipulation and in- 
spection of logos, while offering a variety of logo variants 
for alignments and profiles. The result is a logo tool that 
supports scrolling, zooming, inspection of underlying 
values, and mapping between logo positions and align- 
ment columns. Skylign simplifies the representation of 
gap parameters, offers alternate calculations to deter- 
mine letter heights, and can overcome sampling bias by 
down-weighting redundant sequences and by combining 
observed counts with informed priors. 

Skyligns interactive logos are easily incorporated into a 
web page, and we have already used them in our HMMER 
and Dfam webservers, presenting logos for both protein 
and DNA profile HMMs [10,27], We anticipate that Sky- 
lign will be used to create logos, either in advance or on 
the fly, for other sites that present data related to multiple 
sequence alignments or profile HMMs. 

Availability and requirements 

Skylign can be accessed as a web server and web service, 
and may be downloaded for local use at http://skylign.org. 
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